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OPTIONAL POLYA TREE AND BAYESIAN INFERENCE 

By Wing H. Wong 1 and Li Ma 2 

Stanford University 

We introduce an extension of the Polya tree approach for con- 
structing distributions on the space of probability measures. By us- 
ing optional stopping and optional choice of splitting variables, the 
construction gives rise to random measures that are absolutely con- 
tinuous with piecewise smooth densities on partitions that can adapt 
to fit the data. The resulting "optional Polya tree" distribution has 
large support in total variation topology and yields posterior distri- 
butions that are also optional Polya trees with computable parameter 
values. 

1. Introduction. Ferguson [7] formulated two criteria for desirable prior 
distributions on the space of probability measures: (i) The support of the 
prior should be large with respect to a suitable topology, and (ii) the corre- 
sponding posterior distribution should be analytically manageable. Extend- 
ing the work by Freedman [9] and Fabius [6], he introduced the Dirichlet 
process as a prior that satisfies these criteria. Specifically, assuming for sim- 
plicity that the parameter space is a bounded interval of real numbers, 
and the base measure in the Dirichlet process prior is the Lebesgue measure, 
then the prior will have positive probability in all weak neighborhoods of 
any absolutely continuous probability measure, and given i.i.d. observations, 
the posterior distribution is also a Dirichlet process with its base measure 
obtainable from that of the prior by the addition of delta masses at the 
observed data points. 

While these properties made it an attractive prior in many Bayesian non- 
parametric problems, the use of the Dirichlet process prior is limited by 
its inability to generate absolutely continuous distributions; that is, a ran- 
dom probability measure sampled from the Dirichlet process prior is almost 
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surely a discrete measure [1, 2, 7]. Thus in applications that require the ex- 
istence of densities under the prior, such as the estimation of a density from 
a sample [16] or the modeling of error distributions in location or regression 
problems [5], there is a need for alternative ways to specify the prior. Lo [16] 
proposed an elegant prior in the space of densities by assuming the density 
is a mixture of kernel functions where the mixing distribution is modeled by 
a Dirichlet process. Under Lo's model, the random distributions are guaran- 
teed to have smooth densities and the predictive density is still analytically 
tractable. However the degree of smoothness is not adaptive. 

Another approach to deal with the discreteness problem is to use Polya 
tree priors [8] . This class of random probability measures includes the Dirich- 
let process as a special case and yet is itself a special case of the more general 
class of "tail free" processes previously studied by Preedman [9] . Polya tree 
prior satisfies Ferguson's two criteria. First, it is possible to construct Polya 
tree priors with positive probability in neighborhoods around arbitrary pos- 
itive densities [14]. Second, the posterior distribution arising from a Polya 
tree prior is available in close form [8]. Further properties and applications 
of Polya tree priors are found in [10-12, 15] and [17]. 

In this paper we study the extension of the Polya tree prior construction 
by allowing optional stopping and randomized partitioning schemes. To mo- 
tivate optional stopping, consider the standard construction of the Polya 
tree prior for probability measures in an interval f2. The interval is recur- 
sively bisected into subintervals. At each stage, the probability mass already 
assigned to an interval is randomly divided and assigned into its subintervals 
according to the independent draw of a Beta variable. However, in order for 
the prior to generate absolutely continuous measures, it is necessary for the 
parameters in the Beta distribution to increase rapidly as the depth of the 
bisection increases, that is, as we move into more and more refined levels of 
partitioning [13]. 

In any case, even when the construction yields a random distribution 
with density, with probability 1 the density will have discontinuity almost 
everywhere. The use of Beta variables with large magnitudes for its param- 
eters, although useful in forcing the random distribution to be absolutely 
continuous, has the effect of severely constraining our ability to allocate 
conditional probability to represent faithfully the data distributions within 
small intervals. To resolve this conflict between smoothness and faithfulness 
to the data distribution, one can introduce an optional stopping variable for 
each subregion obtained in the partitioning process [12]. By putting uniform 
distributions within each stopped subregion, we can achieve the goal of gen- 
erating absolutely continuous distributions without having to force the Beta 
parameters to increase rapidly. In fact, we will be able to use Jeffrey's rule 
of Beta (i, i) in the inference of conditional probabilities, regardless of the 
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depth of the subregion in the partition tree. We believe this is a desirable 
consequence of optional stopping. 

Our second extension is to allow randomized partitioning. Standard Polya 
tree construction relies on a fixed scheme for partitioning. For example in [11] 
a /c-dimensional rectangle is recursively partitioned where in each stage of the 
recursion the subregions are further divided into 2 k quadrants by bisecting 
each of the k coordinate variables. In contrast, when recursive partitioning is 
used in other statistical problems, it is customary to allow flexible choices of 
the variables to use to further divide a subregion. This allows the subregion 
to take very different shapes depending on the information in the data. The 
data-adaptive nature of the recursive partitioning is a reason for the success 
of tree-based learning methodologies such as CART [3] . Thus it is desirable 
to allow Polya tree priors to use partitions that are the result of randomized 
choices of divisions in each of the subregions at each stage of the recursion. 
Once the partitioning is randomized in the prior, the posterior distribution 
will give more weights on those partitions that provide better fits to the data. 
In this way the data is allowed to influence the choice of the partitioning. 
This will be especially useful in high-dimensional applications. 

In Section 2 we introduce the construction of "Optional Polya trees" that 
allow optional stopping and randomized partitioning. It is shown that this 
construction leads to priors that give absolutely continuous distributions 
almost surely. We also show how to specify the prior so that it has positive 
probability in all total variation neighborhoods in the space of absolutely 
continuous distributions on £1. In Section 3 we show that the use of optional 
Polya tree priors will lead to posterior distributions that are also optional 
Polya trees. We present a recursive algorithm for the computation of the 
parameters governing the posterior optional Polya tree. These results ensure 
that Ferguson's two criteria are satisfied by optional Polya tree priors, but 
now on the space of absolutely continuous probability measures. In this 
section, we also show that the posterior Polya tree is weakly consistent in 
the sense that asymptotically it concentrates all its probability in any weak 
neighborhood of a true distribution whose density is bounded. In Section 4, 
we develop and test the optional Polya tree approach to density estimation 
in Euclidean space. Concluding remarks are given in Section 5. 

We end this introduction with brief remarks on related works. The im- 
portant idea of early stopping was first introduced by Hutter [12]. Ways to 
attenuate the dependency of Polya trees on the partition include mixing the 
base measure used to define the tree [11, 14, 15], random perturbation of the 
dividing boundary in the partition of intervals [19] and the use of positively 
correlated variables for the conditional probabilities at each level of the tree 
definition (Nieto-Barajas and Miiller [18]). Compared to these works, our 
approach allows not only early stopping but also randomized choices of the 
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splitting variables. This provides a much richer class of partitions than pre- 
vious models and raises the new challenge of learning the partition based 
on the observed data. We show that under mild conditions such learning 
is achievable by finite computation. We also provide a relatively complete 
mathematical foundation which represents the first theory for Bayesian den- 
sity estimation based on recursive partitioning. Although a Bayesian version 
of recursive partitioning has been proposed previously (Bayesian CART, [4]), 
it was formulated for a different problem (classification instead of density 
estimation). Furthermore, it studied mainly model specification and com- 
putational algorithm, and did not discuss the mathematical and asymptotic 
properties of the method. 

2. Optional Polya tree. We are interested in constructing random prob- 
ability measures on a space (p,,fx). Q is either finite or a bounded rectangle 
in MP. In this paper we assume for simplicity that (i is the counting measure 
in the finite case and the Lebesgue measure in the continuous case. Suppose 
that fl can be partitioned in M different ways; that is, for j = 1,2, ... , M, 

ft = VL 3 k where 0^,'s are disjoint. 
k=i 

Each fl k , called a level-1 elementary region, can in turn be divided into 
level-2 elementary regions. Assume there are ways to divide $7^; then 
for j2 = 1, . . . , Mj^ , we have 

K nJ2 

K= U <t 

k 2 =l 

In general, for any level-fc elementary region A, we assume there are M{A) 
ways to partition it; that is, for j = 1,2, ... , M{A), 

a= U 4- 

k=l 

Let A k be the set of all possible level- A; elementary regions, and = 
\Ji =1 A l . If f2 is finite, we assume that A k separates points in f2 if k is large 
enough. If is a rectangle in MP, we assume that every open set B C ft is 
approximated by unions of sets in A^ n \ that is, 3B n f B where B n is a finite 
union of disjoint regions in 
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Example 1. 

tt = {x = (xi,. . • , x p ) : Xi € {1,2}}, 
£l J k = {x: Xj = k}, k = 1 or 2, 

^fcife = ' "^J 1 = ^ 2 = ^2}, etc. 

In this example, the number of ways to partition a level-A; elementary region 
decreases as k increases. 

Example 2. 

n = {(x u x 2 ,...,x p ):xi G [0,1]} cW. 

If A is a level-A; elementary region (a rectangle), and rrij(A) is the midpoint 
of the range of Xj for A, we set A\ = {x € A : Xj < rrij (A)} and A\ = A \ A\ . 
There are exactly M{A) =p ways to partition each A, regardless of its level. 

Once a system to generate partitions has been specified as above, we 
can formally define recursive partitions as follows. A recursive partition of 
depth k is a series of decisions = (J\, J2, ■ ■ ■ , Jk) where Ji represents 
all the decisions made at level I to decide, for each region produced at the 
previous level, whether or not to stop partitioning it further and if not, which 
way to use to partition it. Once we have decided not to partition a region, 
then it will remain intact at all subsequent levels. Thus each specifies 
a partition of f2 into a subset of regions in 

We use a recursive procedure to produce a random recursive partition of 
fl and a random probability measure Q that is uniformly distributed within 
each part of the partition. Suppose after k steps of the recursion, we have 
obtained a random recursive partition 3^ and we write 

f] = T fc UT 1 fc , 

where 

1 

Tq = (J Ai is a union of disjoint Ai <E A^ X \ 
i=i 
r 

T-f = (J A\ is a union of disjoint A\ € A k . 

i=i 

The set To represents the part of f2 where the partitioning has already 
been stopped and T\ represents the complement. In addition, we have also 
obtained a random probability measure on f2 which is uniformly dis- 
tributed within each region in T fc and Tf . 
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In the (k + l)th step, we define Q^ k+l ^ by further partitioning of the 
regions in T k as follows. For each elementary region A in the above decom- 
position of T k , generate an independent random variable, 

S ~ Bernoulli(yo). 

If S = 1, stop further partitioning of A and add it to the set of stopped 
regions. If S = 0, draw J € {1, 2, ... , M(A)} according to a nonrandom vec- 
tor X(A) = (Ai, . . . ,\m(A))i called the selection probability vector, that is, 

P{J = j) = an d Xi = l.Ii J = j, apply the jth way of partitioning 

A, 

K 

A = Aj (here K depends on A and j) 
1=1 

and set Q( k+1 \Aj) = Q^ k \A)9 3 l where 0° ' = (6{, . . . , (P K ) is generated from a 
Dirichlet distribution with parameter (a\, . . . ,a J K ). The nonrandom vector 
ol 3 = at? (A) is referred to as the assignment weight vector. 

After this step, we have obtained Tq +1 and T k+l , the respective unions 
of the stopped and continuing regions. Clearly 

Q = T k+1 UT k+ \ 

Tifc + l rpk rpk-\-l rpk 

Ji 0i 1 1 <- 1 1 ■ 

The new measure Q( fc+1 ) is then defined as a refinement of Q^ k \ For B C 
T^ k+l \ we set 

q(*+i)(B) = q(*)(B). 
For B C T^ k+1 ^ where T k+1 is partitioned as 



T k+1 = \J Ai, Ai e 
j=i 



we set 



Recall that for each in the partition of T k+l , we have already generated 
its 

Q(k+i) probability. 

Let J 7 ^ be the <7-field of events generated by all random variables used 
in the first k steps; the stopping probability p = p(A) is required to be mea- 
surable with respect to F^. The specification of p(-) is called the stopping 
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rule. In this paper we are mostly interested in the case when /?(•) is an "in- 
dependent stopping rule;" that is, p{A) is a pre-specified constant for each 
possible elementary region A. However in some applications it is useful to 
let p(A) depend on Q^ k \A). 

Let _4.(°°) = (JjfcLi A k be the set of all possible elementary regions. 

Theorem 1. Suppose there is a 5 > such that with probability 1, 1 — 
5 > p(A) > 5 for any region A generated during any step in the recursive 
partitioning process. Then with probability 1, converges in variational 

distance to a probability measure Q that is absolutely continuous with respect 
to p. 

Definition 1. The random probability measure Q defined in Theorem 
1 is said to have an optional Polya tree distribution with parameters A, a 
and stopping rule p. 

Proof of Theorem 1. We only need to prove this for the case when 
O is a bounded rectangle. We can think of 's as being generated in two 
steps. 

1. Generate the nonstopped version Q*( fe ) by recursively choosing the ways 
of partitioning each level of regions but without stopping in any of the 
regions. Let J*( fc ) denote the decision made during this process in the first 
k levels of the recursion. Each realization of J*( fc ) determines a partition 
of f2 consisting of regions A G A k (not A^ h) as in the case of optional 
stopping). Let A k ( J*( fc )) = {A G A k : A is a region in the partition induced 
by J* (fc) }. If A G A k {J*^), then it can be written as 

,1 Of/'/' 4 . 

We set 

This defines Q* ( > random measure. 

2. Given the results in Step 1, generate the optional stopping variables 
S = S{A) for each region A € A k (J*^), successively for each level k = 
1,2,3,.... Then for each k, modify Q* ^ to get Q^ k) by replacing Q< k \-\A) 
with p(-\A) for any stopped region A up to level k. 

For each A G A k (J*^), let I k (A) = indicator of the event that A has not 
been stopped during the first k levels of the recursion: 

E(Q^(T^)\J*^) = e( Q* {k \A)I k {A)\r^ 

^AeA k (J*( k )) 
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= E(Q< k \A)\J*^)E(I k (A)\J*^) 
AeA k (J*W) 

<(l-5) k Yl E{Q< k \A)\J< k ^) 
= (l-5) k . 

Thus E(Q { - k) (T k )) ->■ geometrically and hence Q^ k \T^) ->■ with proba- 
bility 1. Similarly, ^(T k ) — > with probability 1. 

For any Borel set 5cO,we claim that limQ^(-B) exists with probability 
1. To see this, write 

Q( fc ) (B) = [b n T fe ) + Q (fc) (B n if) 

= at + bk] 

a/j is increasing since 

Q( k+1 \BnT k+1 ) >Q( k+1 \BnT$) 
= Q«( J Bnr fe ), 

and bk — > since Q -> with probability 1. 

Since the Borel cr-field is generated by countably many rectangles, we 
have with probability 1 that limQ^(-B) exists for all B € B. Define Q(B) 
as this limit. If Q{B) > then Q^ k \B) > for some k. Since < /j. by 
construction, we must also have fi(B) > 0. Thus Q is absolutely continuous. 

For any BeB, Q^ k) (B n T fc ) = Q(B n Tq ), and hence 

|Q (fe) (B) - Q(B)| = |Q( fc )(B HTf) - Q(B n If) | 

<2Q( fc )(r 1 fc ) — >0. 

Thus the convergence of Q< fc ) to Q is in variational distance. □ 

The next result shows that it is possible to construct optional Polya tree 
distribution with positive probability on all L\ neighborhoods of densities. 

Theorem 2. Let fl be a bounded rectangle in W p . Suppose that the con- 
dition of Theorem 1 holds and that the selection probabilities Xi(A), the 
assignment probabilities a\ (A) / Q^, aj (A)) for all i,j and A € A^°°^ are uni- 
formly bounded away from and 1. Let q = dQ/dfJ,; then for any density f 
and any t > 0, we have 

Pi [ \q(x)-f(x)\dn<r) >0. 
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Proof. First assume that / is uniformly continuous. Let 
5(e) = sup |/(s)-/(y)|; 

\x— y\<e 

then 5(e) 1 as e I 0. For any A; large enough, we can find a partitioning 
= Ui=i -^i where Aj G A fc is arrived at by k steps of recursive partitioning 
(deterministic and without stopping) and that each A has diameter < e. 

Approximate / by a step function f*(x) = // I Ai (x),f* = j A Jdji/ 
fi(Ai). Let D £ (f) be the set of step functions g(-) = J2di^Ai(') satisfying 

sup|ft-/*| <5(e). 

i 

Suppose g G D £ (f); then for any i? we have B = \Jj =1 (B n A) = |J- = i 
and 



(5 - /) ^ 



<J2\9i-ftHBi) + Yl 

i i 

<]T<5( e M^) + E r - 



fdfj, 



b, 



where 



ri = n(Bi) 



where x% G5;. Since 
we have 



IaP( x ) ~ f( x k)) d V Ib, ~ f( x k)) d V 



MA) n(Bi 
\f(x) -f(xi)\ <5(e) for x E Ai, 
\n\< 25(e) fi(Bi). 



Hence 



and thus 



B 



(a - f) dfx 



< 35(e)n(B) VB, 



\g-f\dfi<35(e)^(n)=35'(e), 

where 5'(e) = 5(e)fj,(fl). Since all probabilities in the construction of q k 
^g— are bounded away from and 1, we have 

P(q k € D e (f) for all large k) > 0. 
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Hence 



p(J \l k ~f\dn< 35' (e) for all large kj > 0. 
On the other hand, by Theorem 1, we have 



PI / \q k -q\dn^0 ) =1. 



Thus 



p(J \q-f\dii<M\e) \ >0. 

Finally, the result also holds for a discontinuous / since we can approximate 
it arbitrarily closely in L\ distance by a uniformly continuous one. □ 



It is not difficult to specify a\ (A) to satisfy the assumption of Theorem 
2. A useful choice is 

ai(A) = T k n(4)/ii(n) for ,4 G „4 fc , 

where r > is a suitable constant. 

The reason for including the factor r k when A G A k is to ensure that the 
strength of information we specified for the conditional probabilities within 
A is not diminishing as the depth of partition k increases. For example, in 
Example 2 each A is partitioned into two parts of equal volumes; that is, 

A = A{\J 4, v(A{) = M (4) = \^{A). 
Thus AeA k ^> p(4) = 2~( fc+1 V(0): and 

In this case, by choosing r = 2 we have obtained a nice "self-similarity" 
property for the optional Polya tree, in the sense that the conditional proba- 
bility measure will have an optional Polya tree distribution with the 
same specification for a\ 's as in the original optional Polya tree distribution 
for Q. 

Furthermore, in this example if we use r = 2 to specify a prior distri- 
bution for Bayesian inference of Q, then for any A £ A k , the inference for 
the conditional probability #{(^4) will follow a classical binomial Bayesian 
inference with the Jeffrey's prior Beta ^). 
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3. Bayesian inference with an optional Polya tree prior. Suppose we 
have observed x = {x\, X2, . . . ,x n } where independent draws from a 

probability measure Q, where Q is assumed to have an optional Polya tree as 
a prior distribution. In this section we show that the posterior distribution 
of Q given x also follows an optional Polya tree distribution. 

We denote the prior distribution for q = ^2 by vr(-). For any 4c0, we 
define x(A) = {x^ € x : xi G A} and n(A) = #(x.(A)) = cardinality of the set 
x.(A). Let 

q{x) = ~~^~{ x ) for x G 

and 

q(x\A) = for xeA; 

then the likelihood for x and the marginal density for x can be written, 
respectively, as 



P(x\Q)=f[q(xi)=q(x) 
P(x) = J q(x)dir(q). 



The variable q (or Q) represents the whole set of random variables, that is, 
the stopping variable S(A), the selection variable J (A) and the condition 
probability allocation 9j(A), etc., for all regions A generated during the 
generation of the random probability measure Q. 

In what follows, we assume that the stopping rule needed for Q is an 
independent stopping rule. By considering how is partitioned and how 
probabilities are assigned to the parts of this partition, we have 



'K-> 



(3.1) ?(x) = 5u(x) + (1 - S) (JI(0/) n< J 9(x|N J = n 

In this expression: 

(i) u(x) = Yli=i u ( x i) where u(x) = 1S the uniform density on Q. 

(ii) S = S(p,) is the stopping variable for Q. 

(iii) J is the choice of partitioning to use on O. 

(iv) = (n(ttf), . . . ,n{Q J K j)) is the counts of observations in x falling 
into each part of the partition J. 
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To understand g , (x|N ,/ = n- 7 ), suppose J = j specifies a partition O = fi^ U 
^2 U • • • U fi^- ; then the sample x is partitioned accordingly into subsamples, 

x = x(!l J 1 )U'-Ux(fi , )fj ). 

Under Q, if the subsample sizes n\, . . . ,n J Kj are given, then the positions of 

points in x(^) within are generated independently of those in the other 
subregions. Thus 

Ki 

ff(x|N J = nO = n9W n ?)N)' 

i=i 

where 

9 (x(n?)|«J)= n 

Note that once J = j is given, g(-|f^) is generated independently as an 
optional Polya tree according to the parameters p, X,a that are relevant 
within fi?. We denote by 3>(f^) the expectation of g(x(fi^)|fij) under this 
induced optional Polya tree within {l? . 

In fact, for any A C UfcLi A fc , we have an induced optional Polya tree 
distribution tta(q) f° r the conditional density g(-|A), and we define 

*(A) = y g(x(A)|A)d7r A (g), 
if x(A) ^ and = 1 if x(A) = 0. Similarly, we define 
*o(A)=u(x(A)\A) = J] n(x|A) 

and $ (A) = 1 if x(A) = 0. Note that P(x) = $(0) and u(x) = $ (^). 

Next, we successively integrate out [w.r.t. vr(-)] the random variables in 
the right-hand side of (3.1) according to the order q(x\n J ), 6 J , J and S (last). 
This gives us 

(3.2) $(0) = p$ (n) + (l-p)J2 Ai U $ (^')' 

j=l [ ' i=l 

where D{t) = Y{t x ) ■ ■ ■ T{t k )/F(h + --- + t k ). 

Similarly, for any A 6 Ufc=i ^ with x(A) ^ 0, we have 

(3.3) = p$ (A) + (1 - p) ^ Xj II $ (^)' 

i=i ^ J i=i 
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where in? is the vector of counts in the partition A = [jS A\ , and M, Ri , p, A J ' , 
o? , etc., all depend on A. We note that in the special case when the choice 
of splitting variables are nonrandom, a similar recursion was given in [12]. 

We can now read off the posterior distribution of S = S{0) from equation 
(3.2) by noting that the first term p&o(Q) and the remainder in the right- 
hand side of (3.2) are, respectively, the probabilities of the events 

{stopped at f2, generate x from u(-)} 

and 

{not stopped at fi, generate x by one of the M partitions}. 

Thus S ~ Bernoulli with probability p<&o(£l)/&(Q). Similarly, the j'th term 
in the sum (over j) appearing in the right-hand side of (3.2) is the probability 
of the event 

{not stopped at fi, generate x by using the jth way to partition f2}. 

Hence, conditioning on not stopping at J takes value j with probability 
proportional to 



Finally, given J = j, the probabilities assigned to the parts of this partition 
are O 1 whose posterior distribution is Dirichlet (n-? +a J ). 

By similar reasoning, we can also read off the posterior distribution of 
S = S(A),J = J(A),9 j = j (A) from (3.3) for any A C A k . Thus we have 
proven the following. 

Theorem 3. Suppose x = (xi, . . . , x n ) are independent observations from 
Q where Q has a prior distribution ir(-) that is an optional Poly a tree with 
independent stopping rule, and satisfying the condition of Theorem 2, the 
conditional distribution of Q given X = x is also an optional Polya tree 
where, for each A C ^4°° , the parameters are given as follows: 

1. Stopping probability: 

p(A\x)=p(A)$o(A)/HA)- 

2. Selection probabilities: 

i\ I<J 
a.-)) 

D{a.i~ 



D(n? + a*) t 

P(J = J |x)ocA J ^" +gL> ng(4), j = l,...,M. 



3. Allocation of probability to subregions: the probabilities Q\ for subregion 
A\ , i = 1, . . . , K J are drawn from Dirichlet (n J + a?). 
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In the above, it is understood that M, K 3 , A j , n J , a J all depend on A. 

We use the notation tt(-\xi,X2, ■ ■ ■ ,x n ) to denote this posterior distribu- 
tion for Q. 

To use Theorem 3, we need to compute &(A) for A S A°°. This is done 
by using the recursion (3.3), which says that $(•) is determined for a region 
A if it is first determined for all subregions A\ . By going into subregions of 
increasing levels of depth, we will eventually arrive at some regions having 
certain simple relations with the sample x. We can often derive close form 
solutions for $(•) for such "terminal regions" and hence determine all the 
parameters in the specifications of the posterior optional Polya tree by a 
finite computation. We give two examples. 

Example 3 (2 P contingency table). Let Q = {1, 2} x {1, 2} x • • • x {1, 2} 
be a table with 2 P cells. Let x= (x±,X2, ■ ■ ■ ,x n ) be n independent obser- 
vations where each Xi falls into one of the 2 P cells according to the cell 
probabilities {q{y) : y £ $7}. Assume that q has an optional Polya tree distri- 
bution according to the partitioning scheme in Example 1 where Xj = 4j if 
there are M variables still available for further splitting of a region A, and 

= i '■ = 1,2. Finally, assume that p(A) = p where p € (0, 1) is a constant. 

In this example, there are three types of terminal regions. 

1. A contains no observation. In this case, &(A) = 1. 

2. A is a single cell (in the TP table) containing any number of observations. 
In this case, <&(A) = 1. 

3. A contains exactly one observation, and A is a region where M of the p 
variables are still available for splitting. In this case, 



where vrj\,/(-) is the optional Polya tree on a 2 M table. By recursion (3.3) 




we have 




p2 



M 



+ (1 





= 2 



—M 
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Example 4. Q is a bounded rectangle in W with a partitioning scheme 
as in Example 2. Assume that for each region, one of the p variables is chosen 
to split it (Xj = -), and that a\ = |,z = 1,2. Assume p(A) is a constant, 
p G (0,1). In this terminal region A contains either no observations 

[then &(A) = 1] or a single observation x £ A. In the latter case, 



and 



*(A) = r A (x)= / g(x|A)d7r A (Q) 

J A 



p 1* 13(3/2,1/2) 
{ P, p^B(l/2A" 



P 



'p^ 5(1/2,1/2) '^w 



(x) 



//(A) ' v ~ r/ 2' A l{x) 



where i(x) = 1 or 2 according to whether x ^ A\ or A^. Since m(^i) 
^(^2) = 2^(^) ^ or * ne Lebesgue measure, we have 



r A (x) 



^) +(1 "^ 



p 



[i+(i-p)+(i- P ) 2 + 



1 

-pjAY 

Example 5. £1 is a bounded rectangle in R p . At each level, we split the 
regions according to just one coordinate variable, according to a predeter- 
mined order; for example, coordinate variable Xi is used to split all regions 
at the fcth step whenever k = i (mod p). In this case, &(A) for terminal 
regions are determined exactly as in Example 4. By allowing only one way 
to split a region, we sacrifice some flexibility in the resulting partition in 
exchange for a great reduction of computational complexity. 



Our final result in this section shows that optional Polya tree priors lead 
to posterior distributions that are consistent in the weak topology. For any 
probability measure Qo on fi, a weak neighborhood U of Qo is a set of 
probability measures of the form 



U 



Q: 



9i(-)dQ- / gi(-)dQ 



1,2, 



where gi(-) is a bounded continuous function on f2. 
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Theorem 4. Let x±,X2,--- be independent, identically distributed vari- 
ables from a probability measure Q, tt(-) and ir(-\xi, . . . ,x n ) be the prior and 
posterior distributions for Q as defined in Theorem 3. Then, for any Qq with 
a bounded density, it holds with Qq°°^ probability equal to 1 that 

tt(U\x 1 ,. ..,x n ) — > 1 

for all weak neighborhoods U of Qq. 

Proof. It is a consequence of Schwarz's theorem [20] that the posterior 
is weakly consistent if the prior has positive probability in Kullback-Leibler 
neighborhoods of the true density [10], Theorem 4.4.2. Thus, by the same 
argument as in Theorem 2, we only need to show that it is possible to ap- 
proximate a bounded density in Kullback-Leibler distance by step functions 
on a suitably refined partition. 

Let / be a density satisfying sup^gQ f(x) < M < oo. First assume that / 
is continuous with modulus of continuity 5(e). Let Ut=i-^» t> e a recursive 
partition of satisfying A{ E A k and diameter (Ai) < e. Let 

ft = sup /(x), g(x) =^2 i g i I Ai {x) 

and G = J g(x) dfi. We claim that as e — > 0, the density g/G approximates 
/ arbitrarily well in Kullback-Leibler distance. To see this, note that 

0<G-1 = j\g-f)d f , = Y / j^(g(x)-f(x))d f i 



Hence 



0< / f log( f/(g/G))dfi 



f\og(f/g)dfi + J /logGd/x 



<log(G)<log(l + <$(£)/*(«))■ 
Finally, if / is not continuous, we can find a set B C £1 with fJ.(B c ) < e' such 
that / is uniformly continuous on B. Then 

(g-f)dfi= [ (g-f)dn+ [ (g-f)dn 
Jb Jb c 

<5(e)fi{n)+Me' 
and the result still holds. □ 
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4. Density estimation using an optional Polya tree prior. In this section 
we develop and test the methods for density estimation using an optional 
Polya tree prior. Two different strategies are considered. The first is through 
computing the posterior mean density. The other is a two-stage approach — 
first learn a fixed tree topology that is representative of the underlying 
structure of the distribution, and then compute a piecewise constant es- 
timate conditional on this tree topology. Our numerical examples start with 
the one-dimensional setting to demonstrate some of the basic properties of 
optional Polya trees. We then move onto the two-dimensional setting to pro- 
vide a flavor of what happens when the dimensionality of the distribution 
increases. 

4.1. Computing the mean. For the purpose of demonstration, we first 
consider the situation described in Example 2 with p = 1 where the state 
space is the unit interval and the splitting point of each elementary region 
(or tree node) is the middle point of its range. In this simple scenario, each 
node has only one way to divide, so the only decision to make is whether to 
stop or not. Each point x in the state space ft belongs to one and only one 
elementary region in A k for each k. In this case, the posterior mean density 
function can be computed very efficiently using an inductive procedure. (See 
the Appendix for details.) 

In a multi-dimensional setting with multiple ways to split at each node, 
the sets in each A k could overlap, and so the computation of the posterior 
mean is more difficult. One way to get around this problem is to place 
some restriction on how the elementary regions can split. For example, an 
alternate splitting rule requires that each dimension is split in turn (Example 
5). This limits the number of choices to split for each elementary region 
to one and effectively reduces the dimensionality of the problem to one. 
However, in restricting the ways to divide, one wastes a lot of computation 
on cutting dimensions that need not be cut which affects the variability 
of the estimate significantly. We demonstrate this phenomenon in our later 
examples. 

Another way to compute (or at least approximate) the posterior mean 
density is first explored by Hutter [12]. For any point x € O, Hutter proposed 
computing &(Q\x,D) and using <3?(f2|x, D)/$>(£l\D) as an estimate of the 
posterior mean density at x. [Here D represents the observed data; 5>(fJ|D) 
denotes the $ computed for the root node given the observed data points 
and $(0|x, D) is computed treating x as an extra data point observed.] 
This method is general but computationally intensive, especially when there 
are multiple ways to divide each node. Also, because this method is for 
estimating the density at a specific point, to investigate the entire function 
one must evaluate $(0|x, D) on a grid of x values which makes it even more 
unattractive computationally. For this reason, in our later two-dimensional 
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examples we only use the restriction method discussed above to compute 
the posterior mean. 

4.2. The hierarchical MAP method. Another approach for density esti- 
mation using an optional Polya tree prior is to proceed in two steps — first 
learn a "good" partition or tree topology over the state space, and then esti- 
mate the density conditional on this tree topology. The first step reduces the 
prior process from an infinite mixture of infinite trees to a fixed finite tree. 
Given such a fixed tree topology (i.e., whether to stop or not at each step, 
and if not, which way to divide), we can easily compute the (conditional) 
mean density function. The posterior probability mass over each node is 
simply a product of Beta means, and the distribution within those stopped 
regions is uniform by construction. So the key lies in learning a reliable tree 
structure. In fact, learning the tree topology is useful beyond facilitating 
density estimation. A representative partition over the state space by itself 
sheds light on the underlying structure of the distribution. Such information 
is particularly valuable in high-dimensional problems where direct visualiza- 
tion of the data is difficult. 

Because a tree topology depends only on the decisions to stop and the 
ways to split, its posterior probability is determined by the posterior p's and 
A's. The likelihood of each fixed tree topology is the product of a sequence 
of terms in the form, p, 1 — p, Afc, depending on the stopping and splitting 
decisions at each node. One seemingly obvious candidate tree topology for 
representing the data structure is the maximum a posteriori (MAP) topol- 
ogy, that is, the topology with the highest posterior probability. However, in 
this setting the MAP topology often does not produce the most descriptive 
partition for the distribution. It biases toward shorter tree branches in that 
deeper tree structures simply have more terms less than 1 to multiply into 
their posterior probability. While the data typically provide strong evidence 
for the stopping decisions (and so the posterior p's for all but the very deep 
nodes are either very close to 1 or very close to 0), this is not the case for 
the A's. It occurs often that for an elementary region the data points are 
distributed relatively symmetrically in two or more directions, and thus the 
posterior A's for those directions will be much less than 1. As a consequence, 
deep tree topologies, even if they reflect the actual underlying data struc- 
ture, often have lower posterior probabilities than shallow trees do. (This 
failure of the MAP estimate relates more generally to the multi-modality of 
the posterior distribution as well as the self-similarity of the prior process 
and deserves more studies in its own right.) 

We propose the construction of the representative tree topology through 
a simple top-down sequential procedure. Starting from the root node, if the 
posterior p > 0.5 then we stop the tree; otherwise we divide the tree in the 
direction k that has the highest A&. (When there is more than one direction 
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with the same highest the choice among them is arbitrary.) Then we 
repeat this procedure for each Aj, until all branches of the tree have been 
stopped. This can be viewed as a hierarchical MAP decision procedure — 
with each MAP decision being made based on those made in the previous 
steps. In the context of building trees, this approach is natural in that it 
exploits the hierarchy inherent in the problem. 

4.3. Numerical examples. Next we apply the optional Polya tree prior 
to several examples of density estimation in one and two dimensions. We 
consider the situation described in Example 2 with p = 1 and 2 where the 
state space is the unit interval [0, 1] and the unit square [0, 1] x [0, 1], respec- 
tively. The cutting point of each coordinate is the middle point of its range 
for the corresponding elementary region. For all the optional Polya tree pri- 
ors used in the following examples, the prior stopping probability p = 0.5 
and the prior pseudo-count a = 0.5 for all elementary regions. The standard 
Polya tree priors examined (as a comparison) have quadratically increasing 
pseudo-counts a = depth 2 (see [8] and [13]). For numerical purpose, we stop 
dividing the nodes if their support is under a certain threshold which we 
refer to as the precision threshold. We used 10~ 6 as the precision threshold 
in the one-dimensional examples and 10 -4 in the two-dimensional examples. 
Note that in the ID examples, each node has only one way to divide, and so 
we can use the inductive procedure described in the Appendix to compute 
the posterior mean density function. For the 2D examples, we implemented 
and tested the full optional tree as well as a restricted version based on 
"alternate cutting" (see Example 5). 

Example 6 (Mixture of two close spiky uniforms). We simulate data 
from the following mixture of uniforms: 

0.5*7(0.23, 0.232) + 0.5C/(0.233, 0.235) 

and we apply three methods to estimate the density function. The first is to 
compute the posterior mean density using an optional Polya tree prior. The 
second is to apply the hierarchical MAP method using an optional Polya tree 
prior. The third is to compute the posterior mean using a standard Polya 
tree prior. The results are presented in Figure 1. Several points can be made 
from this figure. (1) A sample size of 500 is sufficient for the optional tree 
methods to capture the boundaries as well as the modes of the uniform distri- 
butions whereas the Polya tree prior with quadratic pseudo-counts requires 
thousands of data points to achieve this. (2) With increasing sample size, 
the estimates from the optional Polya tree methods become smoother, while 
the estimate from the standard Polya tree with quadratic pseudo-counts is 
still "locally spiky" even for a sample size of 10 5 . (This problem can be 
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P-M OPTsiI<s=100 



H-MAP QF>TsiZf5=1Q0 



P-M PTsize=100 



S3 - 



— r 
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3 - 



P-M OPTsize=500 H-MAP OPTsize=50O P-M PTsize=50Q 




0.22S 0-232 0.236 0.22« 0.232 0.236 0.22* 0.232 0.236 

x x K 



P-M OPT size=250Q H-MAP OPT size=2500 P-M PT size-2500 




D.iti 0.232 0.236 »j» 0.232 0.236 0.233 0.232 0.233 

I ■ X 



P-M OPT siie= 12500 



H-MAP OPTsize=1250Q 



P-M PT size=12500 



P-M OPTsize=1&+05 



H-MAP OPTsfze=1e+05 



P-M PTsize=1e<-05 



8 = 



no. 



Fig. 1. Density estimation for 0.5(7(0.23, 0.232) + 0.517(0.233,0.235). The five rows rep- 
resent five different sample sizes n = 100, 500, 2500, 12,500 and 100,000. The first column 
corresponds to the posterior mean approach using an optional Polya tree prior. The second 
column corresponds to the hierarchical MAP method using an optional Polya tree prior. 
The green ticks along the top margins of this column indicate the partition learned from 
this method. The third column corresponds to the posterior mean approach using a stan- 
dard Polya tree prior with a = depth 2 . The red dashed lines in all plots represent the true 
density function. 



remedied by increasing the prior pseudo-counts faster than the quadratic 
rate at the price of further loss of flexibility.) (3) The hierarchical MAP 
method performs just as well as the posterior mean approach even though 
it requires much less computation and memory. (4) The partition learned in 
the hierarchical MAP approach reflects the structure of the distribution. 
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Fig. 2. Density estimation for 0.7 Beta(40, 60) + 0.3 Beta(2000, 1000) . The five rows rep- 
resent five different sample sizes n = 100, 500, 2500, 12,500 and 100,000. The first column 
corresponds to the posterior mean approach using an optional Polya tree prior. The second 
column corresponds to the hierarchical MAP method using an optional Polya tree prior. 
The green ticks along the top margins of this column indicate the partition learned from 
this method. The third column corresponds to the posterior mean approach using a stan- 
dard Polya tree prior with a = depth 2 . The red dashed lines in all plots represent the true 
density function. 



Example 7 (Mixture of two Betas). Next we apply the same three meth- 
ods to simulated samples from a mixture of two Beta distributions, 

0.7 Beta(40, 60) + 0.3 Beta(2000, 1000) . 

The results are given in Figure 2. Both the optional and the standard Polya 
tree methods do a decent job in capturing the locations of the two mixture 
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components (with smooth boundaries). The optional Polya tree does quite 
well with just 100 data points. 

Example 8 (Mixture of Uniform and "semi-Beta" in the unit square). 
In this example, we consider a mixture distribution over the unit square 
[0, 1] x [0, 1]. The first component is a uniform distribution over [0.78, 0.80] x 
[0.2,0.8]. The second component has support [0.25,0.4] x [0,1] with X be- 
ing uniform over [0.25,0.4] and Y being Beta(100, 120), independent of 
each other. The mixture probability for the two components is (0.35,0.65). 
Therefore, the actual density function of the distribution is 

°- 35 , °- 65 „ r ( 22 °) ,S»n „.Mi9 n 



0.012 " 1 [o-78,o.8o] x [0.2,0.8] + ^5 x r(i20)r(ioo) y ^ ' i [°- 25 '°- 4 ] >< [ ' 1 ]- 

We apply the following methods to estimate this density — (1) the posterior 
mean approach using an optional Polya tree prior with the alternate cutting 
restriction (Figure 3); (2) the hierarchical MAP method using an optional 
Polya tree prior with the alternate cutting restriction (Figure 4); and (3) 
the hierarchical MAP method using an optional Polya tree prior without 
any restriction on division (Figure 5). The last method does a much better 
job in capturing the underlying structure of the data, and thus requires a 
much smaller sample size to achieve decent estimates of the density. 

Example 9 (Bivariate normal). In our last example, we apply the hier- 
archical MAP method using an optional Polya tree prior to samples from a 
bivariate normal distribution, 

BN ,/0.6\ /O.l 2 



0.4^'^ 0.1 2 

This example demonstrates how the posterior optional Polya tree behaves 
in a multi-dimensional setting when the underlying distribution has smooth 
boundary (Figure 6). Not surprisingly, the gradient or change in density is 
best captured when its direction is perpendicular to one of the coordinates 
(and thus is parallel to the other in the 2D case). 

5. Concluding remarks. In this paper we established the existence and 
the theoretical properties of absolutely continuous probability measures ob- 
tained through the Introduction of randomized splitting variables and early 
stopping rules into a Polya tree construction. For low-dimensional densities, 
it is possible to carry out exact computation to obtain posterior inferences 
based on this "optional Polya tree" prior. A conceptually important feature 
of this approach is the ability to learn the partition underlying a piecewise 
constant density in a principled manner. Although the theory was motivated 
by applications in high-dimensional problems, at present exact computation 
is too demanding for such applications. The development of effective ap- 
proximate computation should be a priority in future works. 
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(c) Sample size = 1000 



(d) Sample size = 5000 



Fig. 3. Density estimate for a mixture of uniform and "semi-Beta" using the posterior 
mean approach for an optional Polya tree with the restriction of "alternate cutting. " The 
white blocks represent the density estimates falling outside of the intensity range plotted. 



APPENDIX 

Here we describe an inductive procedure for computing the mean density 
function of an optional Polya tree when the way to divide each elementary 
region is dichotomous and unique. 

Let Ai denote a level- i elementary region and (k\, . . . , hi) the sequence 
of left and right decisions to reach Ai from the root node CI. That is, 
Ai = fifofoj...*-, where the k's take values in {0, 1} indicating left and right, 
respectively. For simplicity, we let Aq = ft represent the root node. Now for 
any point let {Ai} be the sequence of nodes such that x E USo^- 
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la 7fl 

(c) Sample size = 1000 (d) Sample size = 5000 

Fig. 4. Density estimate for a mixture of uniform and "semi-Beta" by the hierarchical 
MAP method using an optional Polya tree prior with the restriction of "alternate cutting. " 
The dark lines mark the representative partition learned from the method. The white blocks 
represent the density estimates falling outside of the intensity range plotted. 




Assuming ft(Ai) \.0, the density of the mean distribution at x is given by 

lim EP(X G Ai)/fi(Ai). 

i— >oo 

Therefore, to compute the mean density we just need a recipe for computing 
EP(X € Ai) for any elementary region A4. To achieve this goal, first let A\ 
be the sibling of Ai for all i > 1. That is, 

A\ = ^k' 1 k' 2 ---k'. where k'- = kj for j = 1, 2, . . . ,i — 1 and k' i = 1 — k{. 
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(a) Sample size == 100 (b) Sample size = S00 




(c) Sample size - 1000 (d) Sample size - 5000 

Fig. 5. Density estimate for a mixture of uniform distribution and "semi-Beta" dis- 
tribution by the hierarchical MAP method using an optional Polya tree prior (with no 
restriction on division). The dark lines mark the representative partition learned from the 
method. The white blocks represent the density estimates falling outside of the intensity 
range plotted. 

Next, for i > 1, let on and a[ be the Beta parameters for node ^4j_i associated 
with its two children Ai and A\. Also, for i > 0, let pi be the stopping 
probability of A%, and Si the event that the tree has stopped growing on or 
before reaching node A{. With this notation, we have for all i > 1, 

EP(X£ Ai)l(Si) 

= EP(X G ^)l(5i_i) + EP(X e AjliSf^USi) 





Fig. 6. The hierarchical MAP method using an optional Poly a tree prior applied to sam- 
ples from a bivariate normal distribution BN((0.4, 0.6), 0.1 2 /) . 



1 EP(XgA-i)1(5,_0 



a. 



a-i + a[ 

and 

EP(X € A^)l(Sf) = € A i )l(Sf)l(5f_ 1 ] 



1 ^l-^spcxE^-Oicsti). 
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Now let a t = EP(X G Ai)l(Si) and b t = EP(X G Aj)l(5f), then the above 
equations can be rewritten as 

MA.) 



(A.l) 



a-i-l H : — fpibi-i, 



cti + a 



oii + a 



7 (1-Pi)bi- 



i- 



for all i > 1. Because a D = EP(X G fi)l(So) = P(Sq) = POi and 6o = 1 — ao = 
1 — po, we can apply (A.l) inductively to compute the ai and bi for all ^4j's. 
Because EP(X G Aj) = Oj + bi, the mean density at x is given by 

lim EP(X G Ai)fn(Ai) = lim (a, + bi)/^). 
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